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MicroRNAs play critical role in the development and progression of various diseases. Predicting potential 
miRNA-disease associations from vast amount of biological data is an important problem in the biomedical 
research. Considering the limitations in previous methods, we developed Regularized Least Squares for 
MiRNA-Disease Association (RLSMDA) to uncover the relationship between diseases and miRNAs. 
RLSMDA can work for diseases without known related miRNAs. Furthermore, it is a semi-supervised (does 
not need negative samples) and global method (prioritize associations for all the diseases simultaneously). 
Based on leave-one-out cross validation, reliable AUC have demonstrated the reliable performance of 
RLSMDA. We also applied RLSMDA to Hepatocellular cancer and Lung cancer and implemented global 
prediction for all the diseases simultaneously. As a result, 80% (Hepatocellular cancer) and 84% (Lung 
cancer) of top 50 predicted miRNAs and 75% of top 20 potential associations based on global prediction 
have been confirmed by biological experiments. We also applied RLSMDA to diseases without known 
related miRNAs in golden standard dataset. As a result, in the top 3 potential related miRNA list predicted 
by RLSMDA for 32 diseases, 34 disease-miRNA associations were successfully confirmed by experiments. It 
is anticipated that RLSMDA would be a useful bioinformatics resource for biomedical researches. 

MicroRNAs (miRNAs) are a class of small endogenous single-stranded non-coding RNAs (—22 nt), 
which normally post-transcriptionally suppress gene expression and protein production by base pairing 
to the 3' untranslated regions (UTRs) of their target messenger RNAs (mRNAs) 1 A In some cases, 
miRNAs may also function as positive regulators 5,6 . It has been demonstrated that many miRNAs are highly 
conserved 7 . Especially, some of them are even lineage specific. After the discovery of the first two well-known 
miRNAs (Caenorhabditis elegans (C. elegans) lin-4 and let-7 by conventional forward genetic screens 810 ), 
thousands of miRNAs (for example, more than 1400 miRNAs in human according to miRBase 11 ) have been 
discovered in eukaryotic organisms ranging from nematodes to humans in the past few years 12 . It is estimated that 
1-4% genes in the human genome are miRNAs 13 . MiRNAs recognize their target primarily through sequence 
complementarity between the seed region of the miRNA and the binding sites on its target mRNAs 14 . It has been 
conjectured that a single miRNA can regulate as many as 200 mRNAs 13 and about one thirds of human gene can 
be targeted by miRNAs 12,15 . Therefore, one miRNA can regulate many target genes and one target gene can be 
targeted by multiple miRNAs 15 . These miRNA-mRNA interactions construct an important post-transcriptional 
regulatory network which plays critical roles in various biological processes 16-19 . It has been observed that 
miRNA-mediated regulations are evolutionary conserved 1 * -21 and hence typically rare sequence variants that 
disrupt miRNA regulations are often related to human diseases 19,22-24 . 

Accumulating evidences indicates that miRNA is one of the most important components of the cell, playing 
critical roles in many significant biological processes, including the development 25 , proliferation 26 , differenti- 
ation 27 , and apoptosis 28 of the cell, signal transduction 16 , viral infection 27 and so on. Therefore, the dysregulation 
of the miRNAs are related to plenty of the diseases, playing important roles in the development, progression 13,29,30 , 
prognosis, diagnosis, and treatment response evaluation of human disease 31-38 . 

Especially in the last few years, many studies have demonstrated that numerous miRNAs are associated with 
initiation and development of various cancers and cancer-related processes 39-42 . Abnormality of miRNAs leads to 
the dysfunction of downstream target genes, which can lead to the development of cancer in turn 42 . MiRNAs have 
been important part of the field of human molecular oncology 40 . Another well-known example is that mir-375 can 
regulate insulin secretion 43,44 . Therefore, identifying disease-related miRNAs is one of the most important goals of 
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biomedical research, which can benefit the understanding of disease 
pathogenesis at the molecular level, molecular tools design for dis- 
ease diagnosis, treatment and prevention 31 " 34,36,45,46 . Searching for 
disease-miRNA associations form experimental methods is expens- 
ive and time-consuming 45,46 . Encouragingly, plenty of biological data 
about miRNAs has been generated. Therefore, there is strong incent- 
ive to develop powerful computational methods for predicting 
potential disease-related miRNAs on a large scale 47 . Computational 
methods are an essential complementary means for disease-related 
miRNAs prioritization, which can benefit the understanding of 
miRNAs function, decrease the number of biological experiments, 
and select most promising miRNAs for further experimental 
validation 45,47 . 

To provide a comprehensive resource of experimentally verified 
miRNA-disease associations, Lu, et al. 30 and Jiang, et al. 4S successively 
constructed two publicly available and manually curated databases, 
i.e. Human MicroRNA Disease Database (HMDD) and 
miR2Disease. Focusing on cancer-related miRNAs, Yang, et al. 49 
developed a manually curated database of Differentially Expressed 
MiRNAs in human Cancer (dbDEMC). The establishment of these 
disease-related miRNAs databases laid a solid data fundament for 
predictive research. Lu, et al. 30 integrated and analyzed these disease- 
miRNA associations to obtain some important patterns between 
human diseases and miRNAs, which not only benefited the under- 
standing of human diseases at miRNA level, but also laid the solid 
theoretical fundament for the identification of novel disease-related 
miRNAs. The most important conclusion in this paper is that 
miRNAs related to phenotypically similar diseases tend to be func- 
tionally related, which have been treated as the basic assumption of 
many current disease-miRNAs associations predication methods 30 . 

Some bioinformatics methods have been developed for predicting 
novel disease-miRNA associations mostly based on aforementioned 
assumption in literature 30 . Jiang, et al. 45 extended logically previous 
disease genes prioritization methods and developed a computational 
model based on hypergeometric distribution to prioritize the entire 
microRNAome for disease of interest. This method integrated the 
miRNA functional interactions network, disease similarity network, 
and known phenome-microRNAome network constructed based on 
miR2Disease. However, this method only adopts local similarity 
measure and strongly relies on the predicted miRNA-target interac- 
tions, which have a high rate of false-positive and high false-negative 
results. Other limitations lie in the construction of miRNA functional 
similarity network (two miRNAs may be functionally related when 
target genes are located in the same functional modules or pathways, 
rather than significantly share common target genes) and the use of 
disease phenotypical similarity network (Only used the information 
whether or not two phenotype are similar, rather than similarity 
scores). As a result, the prediction accuracy of this method is not 
high. Based on the assumption that most of miRNAs associated with 
given disease regulates genes associated with this disease, or func- 
tionally related genes with these known disease genes, Jiang, et al. 50 
proposed a computational method based on genomic data fusion in 
the framework of naive Bayes. Recently, Shi et al. 51 developed a 
computational framework to identify miRNA-disease associations 
by focusing on the functional link between miRNA targets and 
disease genes in protein-protein interaction networks. These two 
methods strongly relied on known disease-genes association and 
miRNA-target interactions. However, the molecular bases for as many 
as 60% of human disease are unknown. The problem of miRNA-target 
interactions has also limited the application of this method. 

Jiang, et al 46 and Xu, et al. 40 extracted different feature vectors and 
developed the support vector machine classifier to distinguish pos- 
itive disease miRNAs from negative ones, respectively. As we all 
known, selecting negative disease-related miRNAs is currently dif- 
ficult or even impossible. Hence, these methods selected unlabeled 
disease-miRNAs interactions as negative samples, which would lar- 



gely influence the predictive accuracy. Based on the assumption that 
global network similarity measures are better suited to capture the 
associations between diseases and miRNAs than traditional local 
network similarity, Chen, et al. 47 first adopted global network sim- 
ilarity and developed the method of Random Walk with Restart for 
MiRNA-Disease Association (RWRMDA). Also, Xuan et al. 52 
developed the new prediction method of HDMP based on weighted 
k most similar neighbors by calculating the functional similarity 
between miRNAs from the information content of disease terms 
and phenotype similarity between diseases and assigning higher 
weight to members of miRNA family or cluster. RWRMDA and 
HDMP obtained excellent predictive accuracy based on cross valid- 
ation and case studies. However, they does not work for disease 
without any known associated miRNA. Furthermore, the selection 
of parameter k is critical to the performance of HDMP and we should 
have different values of this parameter when different diseases are 
investigated. Recently, Chen and Zhang 53 adopt the method of 
Network-Consistency-Based Inference (Net-CBI) to infer potential 
disease-miRNA associations based on the idea of network consist- 
ency and the integration of miRNA functional similarity network, 
disease similarity network and known miRNA disease associations. 
Although Net-CBI can work for diseases not linked with any known 
miRNAs, the performance is significantly worse than RWRMDA 
based on the validation of cross validation. 

Taken together, the limitations of previous methods are summar- 
ized as follows. Firstly, some methods strongly relies incomplete and 
inaccuracy datasets such as miRNA-target interactions, disease- 
related genes; secondly, some methods need negative disease- 
miRNA associations; thirdly, although methods such as 
RWRMDA have obtained reliable predictive accuracy, they can't 
predict novel miRNAs for diseases which do not have any known 
associated miRNAs; finally, methods such as Net-CBI can work for 
disease without known related miRNAs, but unsatisfactory perfor- 
mances have been obtained. To solve these problems, we developed 
the method of Regularized Least Squares for MiRNA-Disease 
Association (RLSMDA) by integrating known disease-miRNA asso- 
ciations, disease-disease similarity dataset, and miRNA-miRNA 
functional similarity network to uncover potential disease-miRNA 
associations. RLSMDA can predict novel miRNAs for diseases which 
do not have any known related miRNAs. More importantly, it is 
developed in the framework of semi-supervised classifier, so it does 
not need negative miRNA-disease associations. Furthermore, differ- 
ent from RWRMDA, RLSMDA is a global approach which can 
reconstruct the missing associations for all the diseases simulta- 
neously. Cross validations, Case studies about several important dis- 
eases, global prediction for all the diseases simultaneously, and 
independent prediction for diseases without any known related 
miRNAs have fully demonstrated the superior performance of 
RLSMDA to previous methods. 

Results 

Leave-one-out cross validation. Here, we implemented LOOCV on 
known experimentally verified miRNA-disease associations to 
evaluate the predictive performance of RLSMDA. To our 
knowledge, RWRMDA 47 , HDMP 52 , and the global network 
algorithm developed by Shi et al. 51 are the-state-of-art approaches 
in the computational research about disease-related miRNA 
prediction. However, the global network algorithm developed by 
Shi et al. 51 focused on the functional connectivity between miRNA 
targets and disease genes in PPI network. Therefore, this method 
integrated the information of disease gene associations, miRNA- 
target interactions, and protein interactions, which were totally 
different from the dataset used in RLSMDA. Furthermore, this 
method did not use the information of known disease-miRNA 
associations and cross validation by splitting known samples into 
test samples and training samples implemented in this paper 
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Figure 1 | Method comparison: (left) Comparison between RLSMDA and RWRMDA proposed by Chen, et al. 47 in terms of ROC curve and AUC based 
on local leave-one-out cross validation on 1394 known experimentally verified miRNA-disease associations. RLSMDA obtained comparable 
performance in the local LOOCV as RWRMDA, while RWRMDA cannot predict disease-related miRNAs for diseases without known related miRNAs 
and all the diseases simultaneously. RLSMDA can successfully solve these two critical shortcomings of RWRMDA. (right) Comparison between RLSMDA 
and HDMP in the term of global LOOCV. RLSMDA and HDMP obtained the AUC of 0.95 1 1 and 0.943 1 , respectively. Although only slight improvement 
has been obtained here, RLSMDA can predict the potential miRNAs for diseases which do not have known related miRNAs, which has solved the most 
critical limitation of HDMP. The performance of RLSMDA could be further improved by introducing the information of miRNA family and cluster as 
what has been done in the method of HDMP. 



cannot be implemented for this method. Therefore, the performance 
of this method and RLSMDA could not be compared in a fair and 
reasonable way. Based on the above consideration, we will compare 
the performance of RLSMDA with RWRMDA and HDMP. 

For simplicity, we choose ;; M = 1, r\ D = 1 for trade-off parameters 
in the cost functions according to previous literatures 54 and weight 
parameter w = 0.9 in the final classifier considering the fact that 
miRNA functional similarity has played a critical role in disease- 
related miRNA prediction, as what have shown in the method of 
RWRMDA. Both trade-off parameters in the cost function and 
weight parameter in the final classifier can be better selected by 
further cross validation. 

LOOCV can be implemented in the following two ways: (1) For 
the ith disease, each known miRNA associated with disease i was left 
out in turn as test miRNA. Entity F(i,j) in row i column; of the matrix 
F reflect the probability that miRNA j is related to the disease i. How 
well this test miRNA was ranked relative to the candidate miRNAs 
was evaluated based on the ith line of the matrix F (seed miRNAs: 
other known disease-miRNA associations; candidate miRNAs: all 
the miRNAs which do not have the evidence to show their asso- 
ciation with disease i). If the rank of test miRNA exceeds the given 
threshold, the model was considered to successfully predict this 
miRNA-disease association. We called the LOOCV in this way as 
local LOOCV. (2) Unlike LOOCV, we did not give a fixed disease, 
where all the diseases were considered simultaneously. Each known 
disease-miRNA association was left out in turn as test association 
and how well this test association was ranked relative to the candidate 
associations was evaluated based on matrix F (seed associations: 
other known disease-miRNA associations; candidate associations: 
all the disease-miRNA pairs which do not have the evidence to con- 
firm the association). If the rank of test association exceeds the given 



threshold, the model was considered to successfully predict this asso- 
ciation. We called the LOOCV in this way as global LOOCV. The 
difference between local and global LOOCV is whether we consid- 
ered all the diseases simultaneously. From the aforementioned fact 
that RWRMDA cannot uncover the missing associations for all the 
diseases simultaneously, we cannot implement global LOOCV for 
RWRMDA. For the HDMP, global LOOCV can be implemented. As 
a global predictive approach, RLSMDA can be checked in both local 
and global LOOCV. 

Receiver-operating characteristics (ROC) curve was drawn and 
Area under the curve (AUC) was calculated to evaluate the perform- 
ance of predictive methods. ROC curve plots true positive rate (sens- 
itivity) versus false positive rate (1 -specificity) at different thresholds. 
Sensitivity refers to the percentage of the test samples whose ranking 
is higher than a given threshold. Specificity refers to the percentage of 
samples that are below the threshold. AUC = 1 indicates perfect 
performance and AUC = 0.5 indicates random performance. 

According to literature 47 , the AUC of RWRMDA is 0.8617, which 
has significantly improved the performance of previous computa- 
tional method based on the hypergeometric distribution 45 . However, 
for diseases which only have 1 known miRNA, LOOCV can't be 
implemented. To be fair, we think left-out known association 
obtained the random rank in that case, i.e. for N candidate 
miRNAs, we regard the rank of left-out known miRNA as (N+ 1 )/ 
2. Recalculated AUC for RWRMDA was 0.8473. For global LOOCV, 
HDMP obtained an AUC of 0.943 1 . For RLSMDA, AUC in local and 
global LOOCV is 0.8450 and 0.9511, respectively (see Figure 1). We 
can reach the conclusion that the performance of RLSMDA is com- 
parable to RWRMDA and slightly better than HDMP. However, 
RWRMDA and HDMP cannot predict the potential miRNAs for 
diseases which do not have known related miRNAs, which is the 



SCIENTIFIC REPORTS | 4:5501 | DOI: 1 0. 1 038/srep05501 



3 



Table 1 | The top 50 potential Hepatocellular cancer (HCC) related miRNAs predicted by RLSMDA and the confirmation for their associa- 
tions by various databases are listed here ( 1 st column: top 1 -25; 2nd column: top 26-50). Forty of top 50 miRNAs have been confirmed to 
be related with HCC 
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major defect of their methods. Furthermore, RWRMDA is a local 
approach which cannot uncover the missing associations for all the 
diseases simultaneously, i.e. we cannot compare the scores between 
one miRNA and two different diseases. Although there is no signifi- 
cant improvement in the way of AUC, RLSMDA can successfully 
solve aforementioned these two problems. Furthermore, HDMP 
introduce additional information of miRNA family and cluster, 
which benefit the performance of their method. It is much likely that 
the performance of RLSMDA would be further improved after intro- 
ducing the information of miRNA family and cluster into its model. 
Excellent performance demonstrates RLSMDA can recover known 
experimentally verified miRNA-disease associations and hence has 
the potential to predict potential associations. 

Parameter effect. In the above cross validation, we want to place 
more emphasis on miRNA space classifier (this classifier is based on 
the dataset of miRNA functional similarity dataset) in the final 
classifier based on the fact that miRNA functional similarity has 
played a critical role in disease-related miRNA prediction. How- 
ever, we cannot totally rely on the results from miRNA space, 
because in that way we cannot predict potential miRNAs for 
diseases which do not have any known related miRNAs. Therefore, 
we chose weight parameter w = 0.9 in the final classifier. We also 
assigned the different weights for the classifier constructed in the 
miRNA space and calculated corresponding AUCs. The result has 
been shown in Supplementary Figure 1 and it could be observed that 
a higher weight can improve the final performance of RLSMDA. 

Case studies. It has been demonstrated that many miRNAs are 
associated with various human cancers 12-13-38,55-57 and almost half of 
miRNAs are located in cancer-associated genomic regions or fragile 
sites 12-55 . Here, case studies about several important diseases were 
implemented to evaluate the independent predictive ability of 
RLSMDA. Predictive results were confirmed based on the update 
of HMDD and the datasets in miR2disease and dbDEMC. 

Hepatocellular cancer (Hepatocellular carcinoma, malignant 
hepatoma, HCC) is the third leading cause of cancer deaths world- 
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wide nowadays, with over 500,000 people affected (http://emedicine. 
medscape.com/article/197319-overview). As the most common type 
of liver cancer, the most affected people of HCC come from Asia and 
Africa, where high prevalence of hepatitis B and hepatitis C strongly 
leads to the development of chronic liver disease and HCC (http:// 
emedicine.medscape.com/article/197319-overview). In the gold- 
standard data, 34 miRNAs have been related to the development of 
HCC. For example, independent experimental observations showed 
that the expression of miRNAs let-7e, 125a and 99b were quite lower 
in HCC compared to normal liver 58 . MiRNAs without the known 
relevance to HCC were prioritized based on the predictive results of 
RLSMDA. Among the top 50 predicted HCC-related miRNAs, 40 
miRNAs have been confirmed by aforementioned various databases. 
Especially, top 20 potential miRNAs are all confirmed. The top 50 
potential HCC related miRNAs and evidences for the associations 
with HCC were listed (See Table 1). Unconfirmed potential miRNA 
with the highest rank is the miR-34b (ranked 22th). However, the 
recent findings in the literature 59 showed that the potentially func- 
tional SNP rs4938723 in the promoter region of pri-miR-34b/c may 
lead to the development of HCC in the investigated Chinese popu- 
lation, which established the connection between HCC and miR-34b. 
All the datasets used in this paper is generated before the publication 
of this paper. Therefore, this successful independent literature valid- 
ation gave a further strong support to the reliable performance 
demonstration of RLSMDA. We did not further check whether the 
associations between other unconfirmed potential miRNAs and 
HCC can be verified based on recent experimental literatures. 
However, the excellent performance of RLSMDA based on cross 
validation and previous case study makes us believe that RLSMDA 
can predict more disease-related miRNAs. 

In our previous paper about the method of RWRMDA 47 , 98% 
(Breast cancer), 74% (Colon cancer), and 88% (Lung cancer) of 
top 50 predicted miRNAs are confirmed by published experiments. 
It seems that the predictive accuracy for Breast cancer and Lung 
cancer has been much satisfactory. Hence, we implemented the case 
study about Colon cancer here to see whether RWRMDA can further 
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improve the performance of our method in the case study of Colon 
cancer. As the third most common cancer in the world, more than 
half of the people who die of Colon cancer come from developed 
countries (http://en.wikipedia.org/wiki/Colonic_cancer). Usually 
colon cancer strikes without symptoms, therefore, it's important to 
get a colon cancer screening test. If the colon cancer is found early, 
the doctor can use surgery, radiation, and/or chemotherapy for effec- 
tive treatment (http://www.webmd.com/colorectal-cancer/default. 
htm). There are thirty-seven known colon cancer related miRNAs 
in the golden standard dataset. For example, miR-200b and miR-141 
have been shown to be highly overexpressed in colon carcinoma 60 . 
Candidate miRNAs were prioritized in the term of scores obtained 
from the method of RLSMDA. Forty-two out of top fifty predicted 
colonic cancer related miRNAs have been confirmed by various 
databases and literatures 12,61-62 . The top 50 potential colonic cancer 
related miRNAs and confirmation evidences for the associations 
were listed (See Supplementary Table 1). A typical example is miR- 
18b, which is ranked 24th in the predictive list. Recent experimental 
literature confirm its connection to colonic cancer 62 . In that paper, 
the expression of miR-18b was upregulated in colonic cancer tissues, 
compared with the para-cancerous control. Therefore, miR-18b is 
expected to participate in the process of colonic cancer and play a 
critical role in the carcinogenesis of colon. As mentioned, the dataset 
used in this paper for potential miRNAs prediction is generated 
before the publication of this paper. Another independent validation 
further supports the excellent performance of RLSMDA. 

As mentioned, RLSMDA can reconstruct the missing associations 
for all the diseases simultaneously. The top 20 potential disease- 
miRNA associations predicted by RLSMDA and the confirmation 
based on various databases are listed in the Table 2. Fifteen of top 20 
potential disease-miRNAs associations have been confirmed. Also, 
the top 100 potential disease-miRNA associations were shown in 
Supplementary Table 2 and verified based on various databases 
and literatures 12,61 . These 100 potential associations involved various 
diseases, including breast cancer, colonic cancer, brain cancer, type 2 
diabetes and so on. As a result, 61 out of top 100 potential associa- 
tions were confirmed. 

Applicability of RLSMDA to diseases without any known related 
microRNAs. To demonstrate that RLSMDA is applicable to diseases 
without any known associated miRNAs, we implemented case 
studies for the diseases discussed in the above section by removing 



all the known verified miRNAs which have been shown to be related 
to this disease. This operation made sure that prioritizing candidate 
miRNAs for the given disease only made use of the information of 
other diseases having known related miRNAs and similarity 
information. The fact must be pointed out we select the same 
candidate miRNA set as normal case study for a given disease, i.e. 
abandoned known seed miRNAs were not regarded as candidate 
miRNAs. 

For the Hepatocellular cancer, we removed 34 known HCC related 
miRNAs to prioritize candidate miRNAs based on the predictive 
result of RLSMDA. Among the top 50 potential prediction, 36 
miRNAs have been confirmed by various databases. The top 50 
potential HCC related miRNAs when the information about known 
HCC related miRNAs are removed and evidences for the associa- 
tions with HCC were listed (See Supplementary Table 3). The afore- 
mentioned successful independent literature validation example 
about HCC and miR-34b were also ranked in the top 50 predictive 
list. For the colon cancer, after removing 37 known seed miRNAs, 
RLSMDA was implemented to uncover potential connection 
between colon cancer and candidate miRNAs. As a result, 36 out 
of top 50 miRNAs are confirmed by various databases and litera- 
tures 12,61,62 . Top 50 potential miRNAs and the evidences were listed 
(See Supplementary Table 4). Surprisingly, successful independent 
predictive example of miR-18b and colon cancer is ranked 1st by 
RLSMDA when known colon cancer related miRNAs are removed. 

Except for above simulation experiments, RLSMDA was also 
applied to diseases without any known related miRNAs in our golden 
standard dataset. In this way, when we prioritize candidate miRNAs 
for the given disease, only the disease-miRNA associations of other 
diseases and similarity information between these diseases have been 
used. The prediction result was verified based on recent experimental 
literatures. As a result, in the top 3 potential related miRNA list 
predicted by RLSMDA for 32 diseases investigated here, 34 dis- 
ease-miRNA associations were successfully confirmed by biological 
experiments 63 ' 5 (See Table 3). 

For example, hsa-mir-21 has been shown to play a critical role in 
various cellular processes including maturation, migration, prolif- 
eration, and survival. Accumulated evidences has linked mir-21 to 
many complex human diseases and its associations with many dis- 
eases have been collected in the golden standard dataset, such as 
Breast cancer, Brain cancer, Lung cancer, Stomach cancer, and so 
on. Here, we predicted mir-21 as the most likely related miRNAs for 
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Table 3 


Confirmed disease-miRNA associations predicted by RLSMDA for diseases without known related miRNAs 


in ourgolden standard 


data set 






Ranking 


1 ' 1 jCU jCj 


miRMAc 


PMID 




Acute Coronary Syndrome 


hsa-mir-1 


9 1 80AQ99 
z i ovjoyyz 




Aortic Aneurysm, Abdominal 


hsa-mir-21 


zzo j/ jo/ 




Aortic Aneurysm, Thoracic 


hsa-mir-21 


zzU 1 U 1 JV 




Arthritis, Psoriatic 


hsa-mir-1 46a 






Crohn Disease 


hsa-mir-1 6 


ZZooO/ 0/ 




Laryngeal Neoplasms 


hsa-mir-205 


zzoUoo/ 1 




Leukemia, Myelogenous, Chronic, BCR-ABL Positive 


hsa-mir-1 8 1 a 


OO/I 

zz44z0/ 1 




Liver Failure 


hsa-mir-22 1 


Z 1 4UUJJO 




Lupus Erythematosus, Systemic 


hsa-mir-1 46a 


O 1 COO A AO 

z 1 OzV44o 




Mesothelioma 


hsa-mir-1 8a 


z 1 Jjoj4/ 




Osteosarcoma 


hsa-mir-1 5a 


99Q99R97 




Retinoblastoma 


hsa-mir-1 8 1 b 


z 1 0/ J/jj 




Sezary Syndrome 


hsa-mir-21 


1 1 CO COO Q 

z 1 OzoVoo 




Vascular Diseases 


hsa-mir-21 


ZU00UU40 


n 
z 


Amyloidosis 


hsa-mir-1 6 


9 1 /l AfiO 
Z 1 0o40Uz 


n 
z 


Antiphospholipid Syndrome 


hsa-mir-20a 


9 1 ~7QA(X7~7 
Z 1 / V4U/ / 


n 
z 


Aortic Valve Stenosis 


hsa-mir-21 


OOQQOOCTQ 

zzobzVOo 


n 
z 


Atrial Fibrillation 


hsa-mir-22 3 


ZZV44Z JU 


o 
z 


Creutzfeldt-Jakob Syndrome 


hsa-mir-1 46a 




o 
z 


Endometrial Neoplasms 


hsa-mir-1 94 


z 1 bo 1 Oz4 


n 
z 


Huntington Disease 


hsa-mir-200c 


OOOMA 1 OC 
ZZVUO 1 ZJ 


o 

z 


Lichen Planus, Oral 


hsa-mir-21 


0 1 O/1T0OT 
Z 1 V4JZZ J 


z 


Mesothelioma 


hsa-mir-20a 


0 1 T ^ RT /i"7 
Z 1 Jjoj4/ 


n 
z 


Lymphoma, Non-Hodgkin 


hsa-mir-21 


OOyl Q"7"7flQ 
ZZ4£5/ / Uo 


n 
z 


Osteosarcoma 


hsa-mir-1 6 


0OOO0PO"7 


o 
o 


Colitis, Ulcerative 


hsa-mir-1 43 


z 1 JJ/ JV4 


o 
o 


Cystic Fibrosis 


hsa-mir-1 55 


0 1 OQO 1 flA 
z 1 ZOZ 1 UO 


o 


Endometrial Neoplasms 


hsa-mir-1 55 


0 1 1 TA^AH 
z 1 1 / OOOU 


3 


Fibrosis 


hsa-mir-29c 


2 1 784902 


3 


Hyperlipidemias 


hsa-mir-1 22 


22587332 


3 


Keratoconus 


hsa-mir-1 84 


2 1 996275 


3 


Mycosis Fungoides 


hsa-let-7a 


21966986 


3 


Neoplasms, Squamous Cell 


hsa-mir-1 8 1 a 


2 1 244495 


3 


Osteoporosis 


hsa-mir-1 33a 


22506038 



Abdominal Aortic Aneurysm (AAA), Thoracic Aortic Aneurysm 
(TAA), Sezary Syndrome (SS), and Vascular Diseases. These predic- 
tions were all confirmed by biological experiments. Maegdefessel et 
al identified mir-2 1 as a key modulator of proliferation and apoptosis 
of vascular wall smooth muscle cells during development of AAA 
and provided a new therapeutic pathway that could be targeted to 
treat AAA 95 . Jones et al observed decreased expression of mir-2 1 in 
TAA compared to normal aortic samples and further identified a 
significant relationship between its expression level and aortic dia- 
meter 65 . Narducci et al profiled the expression of miRNAs in a cohort 
of 22 SS patients and identified differential expression of mir-2 1 
between SS and controls 75 . Cheng and Zhang pointed out mir-2 1 
plays important roles in biological processes, such as vascular 
smooth muscle cell proliferation and apoptosis, cardiac cell growth 
and death, and cardiac fibroblast functions, and so on. Furthermore, 
they showed that mir-2 1 is proven to be involved in the pathogenesis 
of the cardiovascular diseases 76 . These successful predictive examples 
fully demonstrates that RLSMDA has the potential to provide high- 
quality disease-miRNA associations for the diseases without any 
known related miRNAs, which solved the critical deficiency existing 
in the previous methods. 

Predicting novel human miRNAs-disease associations. Here, we 
further applied RLSMDA to predict potential human disease- 
miRNAs associations after confirming the reliable performance of 
RLSMDA in the term of cross validation and case studies. All the 
known disease-miRNA associations in the gold-standard dataset 
were used as positive samples. We publicly released potential 
human disease-miRNA association list to facilitate the biological 



experimental validation (see Supplementary Table 5). It is anti- 
cipated that potential disease-miRNA associations predicted here 
could be validated by further biological experiments and useful for 
biomedical research. 

Discussions 

Identifying potential disease-miRNA associations is critical for 
understanding the pathogenesis of disease at the miRNA level and 
further improving human medicine. In this paper, RLSMDA was 
developed to identify disease-related miRNAs by integrating dis- 
ease-disease semantic similarity information, miRNA-miRNA func- 
tional similarity information, and known human miRNA-disease 
associations on a large scale. RLSMDA was motivated in the frame- 
work of regularized least squares and the basic assumption that func- 
tionally related miRNAs tend to be related to phenotypically similar 
diseases. Compared with previous methods, RLSMDA can identify 
related miRNAs for diseases without any known associated miRNAs. 
Furthermore, RLSMDA does not need negative samples selection 
and reconstruct the missing associations for all the diseases simulta- 
neously. Cross validation and case studies about Hepatocellular 
cancer and Lung cancer have fully demonstrated the reliable perfor- 
mance of RLSMDA. Furthermore, we implemented simulated case 
studies for Hepatocellular cancer and Lung cancer after removing all 
the known verified miRNAs which have been shown to be related to 
this disease. Plenty of prediction results were confirmed by various 
databases and literature. More importantly, when we applied 
RLSMDA to diseases without any known related miRNAs in our 
golden standard dataset, 34 disease-miRNA associations, ranked in 
the top 3 potential related miRNA list predicted by RLSMDA for 32 
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Disease A 
(Breast Neoplasms) 



Disease B 
(Lung Neoplasms) 




CO4.SS8.894.797.520; C08.381.540; C08.78S.520, 
Lung Neoplasms 



Calculate the contribution 
of disease t in DAG(A) to 
the semantics of disease A 




Calculate the semantic 
value of disease 



Calculate the contribution 
of disease t in DAG(B) to 
the semantics of disease B 



Calcul; 
valu 



Calculate the semantic 
ue of disease B 



Disease pairs sharing larger part of their DAGs are more similar 




Calculate the semantic 
similarity between disease A 
and B 





Figure 2 | The basic idea of disease semantic similarity calculation. 

diseases investigated here, were successfully confirmed by biological 
experiments. 

These excellent examples fully demonstrated that RLSMDA is 
applicable to diseases without any known associated miRNAs. 
Considering the fact that RLSMDA can reconstruct the missing 
associations for all the diseases simultaneously, we applied it to 
implement global prediction for all the diseases simultaneously. As 
a result, 15 of top 20 potential disease-miRNAs associations have 
been confirmed. Also, out of the top 100 potential disease-miRNA 
associations, 61 potential associations were confirmed, involved vari- 
ous diseases including breast cancer, colonic cancer, brain cancer, 
type 2 diabetes and so on. We publicly released potential miRNA lists 
for 137 diseases investigated in this paper to guide biological experi- 
ments. It is anticipated that RLSMDA would be a useful resource for 
researches about the associations between miRNAs and human 
diseases. 

The reliable performance of RLSMDA could largely be attrib- 
uted to several factors as follows. Firstly, heterogeneous datasets 
(known disease-miRNA associations, miRNA functional similar- 
ity, and disease semantic similarity) were integrated to capture the 
potential associations between disease and miRNA. Especially, 
RLSMDA can predict potential related miRNAs for diseases with- 
out any known associated miRNAs by introducing the informa- 



tion of disease similarity. Secondly, RLSMDA is a semi-supervised 
method, which overcomes the difficulties in obtaining negative 
disease-miRNA associations samples in the practical problems. 
Finally, RLSMDA is a global approach, which can predict the 
scores between miRNAs and diseases for all the diseases simulta- 
neously. These three critical success factors also constitute the 
novelties of RLSMDA. Hence, RLSMDA represents a novel, useful, 
and important biomedical resource for miRNA-disease association 
identification. 

Although there are several important novelties in the method 
development of RLSMDA, some limitations also exist. Firstly, how 
to decide the parameters values in the RLSMDA is not still solved 
well. Especially, we need to integrate predictive result from disease 
space and miRNA space by weight parameters. How to directly 
obtain a single classifier or reasonably integrate results from different 
spaces would be a critical problem for future research. Secondly, 
more reliable construction of disease similarity and miRNA similar- 
ity would further improve the predictive ability. We plan to integrate 
more biological relevant information to define miRNA similarity and 
disease similarity. Thirdly, more available experimentally verified 
human disease-miRNA associations would promote the develop- 
ment and the performance of computational human disease- 
miRNA identification methods. 
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Complies with 
the known 
disease- 
related 
miRNAs 
information 



Smooth over 
the miRNA 



Optimized classification 
function in disease space 




Complies with 
the known 
disease- 
related 
miRNAs 
information 



Smooth over 
the disease 




I 



I 




F* =SM*(SM+/ ?M *I M )*A T 




Figure 3 | The flowchart of RLSMDA includes three steps: solving optimization problem; obtaining the optimal classifier in the disease and miRNA 
space, respectively; combining classifiers in the disease and miRNA space to obtain final predictive result. 



Methods 

Human miRNA-disease associations. The human miRNA -disease association 
dataset used as gold standard dataset in this paper was downloaded from the 
supplementary material of literature 96 {obtained from HMDD in September, 2009). 
We want to confirm our prediction list based on the update of HMDD and the 
datasets in other datasets, so we did not use the newest association dataset in HMDD 
and the datasets in the other databases. The gold standard in this paper includes 1616 
distinct high-quality experimentally verified human miRNA-diseases associations. 
After implementing the operations such as merging different miRNA copies which 
produce the same mature miRNA and unifying the name of mature miRNAs and 
diseases, 1395 miRNA-disease associations, including 271 miRNAs and 137 diseases, 
were used in this paper (see Supplementary table 6). We use nd as the number of 
diseases and nm as the number of miRNAs. Matrix A is denoted as the adjacency 
matrix of disease-miRNA associations, where the entity A(i,j) in row i column j is 1 if 
miRNA; is related to the disease i, otherwise 0. 

MiRNA functional similarity. In the literature 96 , functional similarity score for each 
miRNA pair was calculated based on the assumption that miRNAs with similar 
functions tend to be related with similar diseases. We downloaded the miRNA 
functional similarity scores from http://cmbi.bjmu.edu.cn/misim/ in January 2010 
(see Supplementary table 7). Matrix SM is denoted as the miRNA functional 
similarity matrix, where the entity SM(i,j) in row i column j is the functional similarity 
between miRNA i and j. MiRNA functional similarity used here has been used to 
predict disease-related miRNAs and environmental factor-miRNA combination 
interactions and excellent performance have been obtained 47 ' 97 . 

Disease semantic similarity. Here, we calculated the disease similarity in the same 
way as literature 96 . The basic idea of disease semantic similarity calculation is 
illustrated in Figure 2. We can obtain the relationship between diseases from MeSH 



database (http://www.ncbi.nlm.nih.gov/), which provided a strict system for disease 
classification. Disease can be described as a DAG, where the nodes represent disease 
itself and its ancestor diseases and the link from a parent node to a child node 
represents the relationship between these two nodes. For example, disease A can be 
described as a graph DAG(A) — (A,T(A),E(A)), where T(A) is the node set including 
node A itself and all ancestor nodes of A and E(A) is the corresponding links set. The 
contribution of disease t in DAG(A) to the semantics of disease A is defined as follows: 



D A (A) = 1 

D A (0 = max{A * D A {t')\t ! e children of t} 



ifr^A 



(1) 



where A is the semantic contribution factor. The contribution of disease A to its own 
semantic value is one, while the contributions of other ancestor diseases to the 
semantic value of disease A decrease with the distance between this disease and 
disease A. Therefore, we can define the semantic value of disease A based on the 
contribution of ancestor diseases and disease A itself, i.e. 



DV(A)= £ D A (t). 



(2) 



Based on the assumption that disease pairs sharing larger part of their DAGs are more 
similar, we denned the semantic similarity between two diseases A and B as follows: 



SD(A,B)- 



E (D A (t) + D B (t)) 

IeT(A)P|T(B) 

DV(A) + DV(B) ' 



(3) 



Matrix SD is denoted as the disease semantic similarity matrix, where the entity 
SD(i,j) in row i column j is the disease semantic similarity between disease i and j (see 
Supplementary table 8). 
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Regularized Least Squares for MiRNA-Disease Association (RLSMDA). Based on 
the underlying assumption that miRNAs associated with more similar diseases are 
more similar, and vice versa, here we developed the method of Regularized Least 
Squares for MiRNA-Disease Association (RLSMDA) to uncover the potential 
miRNAs associated with various diseases (See Figure 3). RLSMDA is designed to 
construct a continuous classification function which can reflect the probability that 
each miRNA is related to a given disease. We hope the function can meet the 
following two criterions: (1) it complies with the known disease -related miRNAs 
information; (2) it is smooth over the miRNA space and disease space, i.e. for a given 
disease (miRNA), similar miRNAs (diseases) would obtain similar scores, which meet 
the basic assumption of our methods. Considering the difficulties of obtaining 
negative sample, a semi -supervised classifier is constructed under the framework of 
Regularized Least Squares (RLS), which is obtained by defining a cost function and 
minimizing this cost function. Cost functions can be developed in miRNA space and 
disease space, respectively. Taking miRNA space and as an example, optimal 
classification function can be obtained by solving the following optimization 
problem: 

min[\\A T -F M \\ 2 F + , lM *\\F M *SM*F T M \\ 2 F } (4) 

where ||-|| F is the Frobenius norm and t] M is the trade-off parameter. The solution of 
this optimization problem is: 

F M = SM * (SM + n M * W * aT (5) 

where I M is the identity matrix with the same size as matrix SM. 

In the similar way, we can obtain the optimal classification function in the disease 
space as follows: 

F* D = SD*(SD+ti D *I D )*A (6) 

where I D is the identity matrix with the same size as matrix SD. 

Finally, the optimal classifier in two different spaces will be combined to give the 
final solution based on a simple weighted average operation, i.e. 

F* = w*F*J + (l-w)*F* D (7) 

where the entity F(i,j) in row i column j reflect the probability that miRNA j is related 
to the disease i. 
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